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Abstract 



As file systems are increasingly being deployed on ever 
larger systems with many cores and multi-gigabytes 
of memory, scaling the internal data structures of file 
systems has taken greater importance and urgency. A 
doubly-linked list is a simple and very commonly used 
data structure in file systems but it is not very friendly to 
multi-threaded use. While special cases of lists, such as 
queues and stacks, have lock-free versions that scale rea- 
sonably well, the general form of a doubly-linked list of- 
fers no such solution. Using a mutex to serialize all oper- 
ations remains the de-facto method of maintaining a dou- 
bly linked list. This severely limits the scalability of the 
list and developers must resort to ad-hoc workarounds 
that involve using multiple smaller lists (with individual 
locks) and deal with the resulting complexity of the sys- 
tem. 

In this paper, we present an approach to building highly 
concurrent data structures, with special focus on the im- 
plementation of highly concurrent doubly-linked lists. 
Dubbed "advanced doubly-linked list" or "adlist" for 
short, our list allows iteration in any direction, and in- 
sert/delete operations over non-overlapping nodes to ex- 
ecute in parallel. Operations with common nodes get se- 
rialized so as to always present a locally consistent view 
to the callers. An adlist node needs an additional 8 bytes 
of space for keeping synchronization information - this 
is only possible due to the use of light-weight synchro- 
nization primitives called lwlocks which are optimized 
for small memory footprint. A read-write lwlock occu- 
pies 4 bytes, a mutex occupies 4 bytes, and a condition 
variable occupies 4 bytes. The corresponding primitives 
of the popular pthread library occupy 56, 40 and 48 bytes 
respectively on the x86-64 platform. The Data Domain 
File System makes extensive use of adlists which has al- 
lowed for significant scaling of the system without sacri- 
ficing simplicity. 



1 Introduction 

The simplest data structures often get used the most. 
This is particularly true for linked lists which are used 
in a wide variety of ways in most file systems. Linked 
lists can be used to hold incoming 10 requests, for sim- 
ple LRUs caches, for maintaining work queues or other 
FIFO data structures. While simple and useful, a generic 
doubly-linked list does not scale well with respect to con- 
current access. 

As hardware platforms support increasing number of 
cores, having basic data structures be as concurrent as 
possible is now a prerequisite ifTUl . There have been 
good strides in building lock-free or highly-concurrent 
versions of simpler variants of lists, such as LIFO and 
FIFO queues J3] [7j, singly-linked lists |3), or double- 
ended queues ifTTl . and many more variants |l j. There 
are a number of published lock-free approaches both 
for any data structure in general |2j |6| and for generic 
doubly-linked lists, in particular IfTTl [8). All lock-free 
approaches, however, either do not support true itera- 
tors, or implicitly rely on the system having a strong 
garbage collection: otherwise, an iterator having read the 
next pointer from a node will have no guarantee that the 
next node will be valid when the pointer is dereferenced. 
Some lock-free approaches are also not entirely practical 
owing to their reliance on hardware features not com- 
monly available or the complexity of the approaches. 
Due to these limitations, we do not presently consider 
these lock-free approaches viable for our needs in build- 
ing file systems. 

The default approach, therefore, for protecting a generic 
doubly-linked list in a multi-threaded environment with 
no special system support remains the use of a single lock 
to protect the entire list during any supported operation. 
Attempts to increase concurrent access by using local- 
ized locks usually runs afoul of one or more of the fol- 
lowing issues: (i) the memory overhead is too high when 



using per node locks, (ii) the need to enforce "lock order- 
ing" restricts supportable API or (iii) the supported API 
is prone to spinning behavior and restricts member nodes 
to preallocated memory. 

In this paper, we present our approach to building highly- 
concurrent doubly-linked lists. While we focus primar- 
ily on the list data structure, the general ideas presented 
are applicable to other data structures as well. We call 
our highly-concurrent doubly-linked list an "advanced 
doubly-linked list" or "adlist" for short. An adlist sup- 
ports iteration (forward or backward), insert before/after 
any node (or at head/tail of list) and delete of any speci- 
fied node (or dequeue/pop from list). It also places no re- 
strictions on the number of nodes a list can have or where 
the memory for the nodes comes from. The nodes can be 
dynamically allocated and returned to the system mem- 
ory once removed from the list. In short, it supports most 
of the operations that one would expect to perform on 
a generic doubly-linked list. Operations that involve dis- 
joint sets of nodes proceed in parallel. Operations with 
common nodes get serialized so as to always present a 
locally consistent view to the callers. The operations are 
not lock-free but threads do not spin when contending on 
a node. 

The low memory overhead per node is possible due to 
the use of "lwlock" |4| primitives. The "asynchronous" 
or "deferred blocking" mode of lwlocks also provide 
the crucial framework for dealing with lock ordering is- 
sues. The synchronization structures add an overhead of 
8 bytes per node. The adlist was developed for and is 
used extensively in the Data Domain File System lTT2l . 
We have found it to lower contention and improve scala- 
bility of our system significantly without sacrificing sim- 
plicity of code. 

The rest of the paper is organized as follows: Sec- 
tion 12 summarizes the basic idea behind lwlocks and 
asynchronous-locking as used in adlists. Section [3] de- 
scribes the internal structure of the synchronization prim- 
itive of an adlist node. Section [4] walks through the APIs 
supported by an adlist highlighting how we maintain cor- 
rectness. In section [5] we discuss our observations and 
experience with using adlists as well as touch upon pos- 
sible extensions. In section [6] we present experimental 
evaluation of adlist performance. Finally, in section|7]we 
present our conclusions. 



2 Light-weight locks 

Light-weight locks Q or lwlocks are synchronization 
primitives that are optimized for small memory footprint 



// Assume that each boolJ takes 4 bytes, each 
// pthreadjnutexJ takes 40 bytes, each pthread j:ondJ 
// takes 48 bytes, and each function pointer takes 8 bytes. 
interface event.t { 

void signalQ; 

void wait(); 

bool_t poll(); 
} // 24 bytes 

interface domain.t { 

waiter.t alloc.waiter ( ) ; 

void free.waiter ( waiter.t waiter); 

waiter_t get_waiter ( ) ; 

waiter_t id2waiter ( uint 1 6_t id); 
} // 32 bytes 

struct waiter.t { 

event.t event; 

domain_t domain; 

bool_t signaLpending ; 

bool_t waiter_waiting ; 

pthread_mutex_t mutex; 

pthread_cond_t cond; 

uint64_t app_data; 

uintl6_t id; 

uintl6_t next; 

uintl6_t prev; 
} // 166 bytes 

Figure 1 : Definition of a waiter structure. 



while maintaining performance comparable to pthread 
library's synchronization primitives in low contention 
cases. A read-write lwlock occupies 4 bytes, a mutex oc- 
cupies 4 bytes (2 if deadlock detection is not required), 
and a condition variable occupies 4 bytes. The corre- 
sponding primitives of the popular pthread library oc- 
cupy 56 bytes, 40 bytes and 48 bytes respectively on the 
x86-64 platform. 

The core idea behind lwlocks is the observation that 
while a thread could block on different locks or wait on 
many different condition variables in its lifetime, it can 
block on only one lock or condition variable at any given 
point. With lwlocks, whenever a thread has to block, it 
uses a "waiter" structure to do so. Each thread has its 
own waiter structure and can access it by invoking the 
tls_get_waiter function (which returns the pointer 
to the waiter kept in the thread local storage). 

Figure [T]presents the definition of a waiter structure. For 
compact representation, the maximum number of waiter 
structures is limited to be less than 2 16 so that each struc- 
ture can also be uniquely referred to by a 16-bit number. 
The value 2 16 — 1 is used to represent the NULL waiter 
structure and denote it by NULLID. The limit on number 
of waiters, and hence on the number of threads, is large 
enough for most applications and certainly so for use in 
adlists. 

A waiter structure is assigned to a thread the first time 
the thread accesses it (via tls_get_waiter) and the 
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structure is returned to the pool of free waiter structures 
when the thread exits to be re-used by a later thread. The 
waiter structure is the key piece that enables the compact 
nature of lwlocks. It can also be used to create other cus- 
tom compact lock-like data structures, a feature that is 
used in adlist (see section[3]l. 

Waiter's Event. A generic event interface underlies the 
actual mechanics that are used by a thread when it blocks 
or unblocks on a lock. The two main operations defined 
for an event are: (i) wait, which is called to wait for 
the event to trigger; and (ii) signal, which informs a 
waiter of an event getting triggered. A waiter structure 
uses one pthread mutex and one condition variable to im- 
plement both operations. The operation wait blocks the 
thread on the condition variable until a signal arrives. The 
operation signal wakes up the blocked thread. Like 
semaphores, the implementation ensures that a signal on 
an event cannot be lost, i. e., a signal can be invoked 
before the matching wait is and the wait will find the 
pending signal. Unlike a semaphore, however, the op- 
erations wait and signal are always called in pairs. 
There is also a third operation called poll. It can be 
used to check if a signal is already pending. 

Forming Lists or Stacks of Waiters. Each waiter struc- 
ture records its own id. It also has space for previous and 
next id values which can be used to form stacks or lists 
of waiters. Such a list (or stack) of waiters can be identi- 
fied purely by the id of the first element of the list, i. e., 
it can be represented by a 16-bit value. To go to the next 
(previous) waiter, we convert the current id to the corre- 
sponding waiter structure and look at the next (previous) 
id field in it. 

Locking Data. The final important piece of a waiter 
structure is the space it provides that can be used by 
the abstractions built on top for their own purpose. The 
waiter itself does not interpret it in any way. For in- 
stance, read-write lwlocks use this space to record the 
type of locking operation that the thread was performing 
when it blocked: whether it was taking a read or write 
lock. This space amounts to 8 bytes and is referred to as 
app_data. 

2.1 Reader- writer lwlock 

Adlists use a reader- writer lwlock in each adlist node and 
it accounts for 4 out of the 8 bytes of synchronization 
space. Henceforth, we refer to a reader- write lwlock as 
simply a lwlock. An lwlock by default is "fair": the lock 
is acquired in FIFO order by the threads blocked on it 
and a thread that wants to acquire an lwlock will block 
if there are already other threads waiting to acquire the 



lock. Pthread locks are not fair in this sense, and although 
it is possible to build lwlocks to mimic the same behav- 
ior, we have found fairness to be the better option in gen- 
eral. Adlists, in particular, rely on the fairness character- 
istics of lwlocks. 

An lwlock uses 2 bytes to keep a queue of waiter struc- 
tures of the threads that have requested the lock which 
is currently held. This queue is aptly called a waitq. 
The waitq is maintained as a "reverse list" as that al- 
lows insertion of a new waiter in a single hardware sup- 
ported compare-and-swap (CAS) instruction. The next 
field of a waiter structure holds the id of the waiter struc- 
ture in front of it. The oldest waiter's next field holds 
NULL ID. Another important feature of fair lwlocks is 
that lock ownership is transferred by the owner during an 
unlock. The unlocking thread has to walk the waitq 
to find the waiter(s) to signal. It removes the selected 
waiter(s) from the waitq before calling signal. At 
any point there can be only one thread performing the 
transfer on a lock and hence the walk and waitq modifi- 
cation is safe to perform. Since the unlocking thread does 
the work of transferring the lock state and ownership, the 
thread corresponding to a waiter can and should assume 
it has the lock once signal is called on the waiter. 

Of the remaining 16 bits of an lwlock, 14 bits are used 
for the count of read locks granted, 1 bit is used to in- 
dicate a write lock, and the final 1 bit is used to indicate 
whether the lock is read-biased or not. A read-biased lock 
is unfair towards writers in the sense that a thread that 
needs a read lock will acquire it without any regard to 
waiting writers if the lock is already held by other read- 
ers. This behavior is similar to that of pthread read-write 
lock and is essential for applications where a thread can 
recursively acquire the same lock as a reader. Without the 
read-biased behavior, a deadlock can result if a writer ar- 
rives in between two read lock acquisitions: the second 
read lock attempt will wait for the writer which is wait- 
ing for the first read lock to be released. Applications that 
do not have recursive read locking do not need the read- 
biased behavior but may choose to use it for throughput 
reasons □ Adlists do not use read-biased lwlocks and we 
do not discuss them further in this paper. 

Figure [2] outlines the algorithms for the two main oper- 
ations: (i) lock, and (ii) unlock. To acquire a lock, a 
thread uses the CAS instruction to either take ownership 
of the lock or add its own waiter structure to the lock's 
waitq. If the lock is acquired, nothing more needs to be 

'The 14-bit reader count limits the maximum number of readers 
per lock to 2 14 , a limit that we have found to be sufficient in practice 
for adlists. The limit can be raised by having the API explicitly flag 
read-bias behavior, so the bias bit does not have to be in the lwlock 
or restricting the maximum concurrency, thereby freeing bits from the 
waitq or by slightly increasing the size of the lock. 
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done. If it is not acquired, then the thread has to wait 
for the s ignal to arrive. We note that the lock operation 
is based on the novel asynchronous locking functionality 
of lwlocks, which is described in the following section. 
The unlock operation has to pick the oldest set of waiters 
that it can signal: either a single writer or a set of con- 
tiguous readers, and change the lock state accordingly. 
The algorithm outline in figure [2] is only at a high level 
for the contention case on non-read-biased lwlock. The 
non-contented case is simple to derive. We encourage in- 
terested readers to get all the details about the workings 
of all the lwlock primitives in H • 



2.2 Asynchronous locking 

A key observation about lwlock's locking behavior is that 
once a thread has put itself on the waitq of a lwlock, 
it is guaranteed to have the lock transferred to it. The 
thread does not have to call wait right away. It has to 
call wait eventually but it can perform other actions be- 
fore calling wait. Upon returning from wait, a thread 
can assume that it holds the lock in the mode it requested. 
This observation is reflected in the pseudo-code in fig- 
ure[2]and forms the basis of asynchronous locking. 

The asynchronous locking functionality of lwlocks is im- 
plemented by the async_lock function. If the lock 
could not be acquired, it does not call wait itself but 
returns a boolean so the caller knows if it should. Calling 
async_lock is similar to calling a trylock except 
the caller is then guaranteed to have the lock assigned 
to it at some future point. The caller must eventually 
call wait to acknowledge the ownership and once done, 
unlock the lock. Since each thread has only 1 waiter 
structure, once a call to async_lock returns FALSE, 
the thread must not perform any action that would re- 
sult in the structure being used again before it has called 
wait. Failure to adhere to this rule will most likely lead 
to one or more of system hang, corruption or crash. 

Asynchronous locking is the key enabler to work around 
the constraints that lock-ordering imposes. An adlist al- 
lows concurrent appends, dequeues, inserts (before or af- 
ter any member), deletes and iterators (in either direc- 
tion). Some of these operations need to acquire locks in 
opposite order of other operations. To avoid deadlocks, a 
canonical order is picked (follow the next pointer) and 
operations that need to acquire locks in the opposite di- 
rection (along prev pointer) use asynchronous locking. 



struct lwlock.t { 
uint 1 _t rd_bias ; 
uintl_t wlocked; 
uintl4_t readers; 
uintl6_t waitq; 

} 

booLt async_lock(lwlock_t I, boolj exclusive) { 
w = tls _get_waiter ( ) ; 
do { 

o = n = £; 

if ([exclusive && !o . wlocked && 
(o. waitq == NULLID 1 1 
o. rd.bias )) 
n . readers++; 
else if (exclusive && 

!(ra. wlocked || n. readers > 0)) 
n. wlocked = 1; 
else { //Need to block 

ui.app_data = exclusive; 
uj.next = o. waitq; 
n . waitq = w . id ; 

} 

} while (!CAS(£, o, n)); 

if (n. waitq == w.id) return FALSE; 

else return TRUE; 

} 

void lock(lwlock_t I, booLt exclusive) { 

if (lasync _lock(^, exclusive)) w. event. wait(); 
return ; 

} 

uintl6_t waitq.size ( uint 1 6_t wid) { 
uintl6_t count = 0; 
while (wid != NULLID) { 
wid = id2waiter (wid) . next ; 
count++; 

} 

return count; 

} 

void unlock_fair(lwlock_t £) { 
do { 

o = n = I; 

if (n. wlocked ==1) n. wlocked = 0; 
else n. readers — ; 

if (! (n. wlocked || n. readers > 0)) { 
(pw, wtw) = find_oldest_set_oLwaiters(ra) ; 
if (pw == NULL) n. waitq = NULLID; 
if (wtw . app.data != exclusive) { 
n. readers = waitq.size (wtw.id) ; 
} else // single writer picked 
n. wlocked = 1 ; 

} 

} while (!CAS(£, o, n)); 

if (pw != NULL) pw. next = NULLID; 

wake.up .waiters (wtw ) ; 

} 

Figure 2: Operations to lock and unlock a lwlock. The 
lock operation takes a boolean as input to indicate 
whether an exclusive lock is requested. The old and new 
values passed in to CAS are denoted by o and n, respec- 
tively. The caller's thread local waiter structure is de- 
noted by w. We use wtw and pw to denote the waiter 
to wake up, and the waiter before wtw in the waitq re- 
spectively. 
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Using asynchronous locking for adlists 

The following example illustrates how asynchronous 
locking is used and why it is essential. Suppose the 
canonical order for nodes A & B is A, then B. A thread 
holds a lock on B already and needs to lock A. It will 
make an asynchronous lock call for A. If the thread is 
unable to get the lock, it is on As waitq, and it releases 
the lock on B. It then waits for the lock on A to be granted 
and then reacquires the lock on B (which is in canonical 
order). In the above sequence, the thread always either 
holds a lock (on A or B) or is in the waitq of a lock 
(on A). Other guarantees in adlist implementation ensure 
that in this case A and B will remain valid and hence 
there will be no illegal access. Achieving this without 
asynchronous locking is not possible. Using try lock 
on A and upon failure, releasing B then locking A leaves 
a window open between release of B and locking of A 
where neither node is in any way aware of the thread. 
One or both nodes could go away in that window and the 
thread would end up performing an illegal access. 



3 Internals of an adlist node 

Figure [3] shows the internal fields of an adlist node. The 
prev and next pointer take up 16 bytes, the lwlock 
takes up 4 bytes and the remaining 4 bytes are taken up 
by an adl_refcnt_t structure. The adl_ref cnt_t 
structure is a custom synchronization primitive that is 
built using the same general principals as other lwlock 
primitives. The lock field protects the prev and next 
pointers. The lock must be acquired in shared mode for 
reading either of the pointers and in exclusive mode for 
changing either of the pointers. We currently use a sin- 
gle lock to protect both the pointers but it is not nec- 
essary to do so. Using separate locks for the 2 pointers 
would allow greater concurrency at the expense of the 
additional overhead. For now a single lock has sufficed 
for our needs. The adlist structure itself contains 2 fixed 
nodes, the head and tail nodes between which all user 
added nodes are kept. 

3.1 adl refcnt t 

The adl_refcnt_t structure provides synchroniza- 
tion between read-only and modifying operations on an 
adlist. As can be seen in figure [3] its internal structure is 
somewhat analogous to that of an lwlock. Like a lwlock, 
the structure is also manipulated using CAS instructions. 
The behavior, however, is quite different. We now go 
over the details of each of the fields. 



struct adl_refcnt_t { 

uintl_t mask; 

uintl5_t pincount; 

uintl6_t waitq; 
} // 4 bytes 

struct adl_node_t { 

adl_node_t *next; 

adl_node_t *prev; 

lwlock_t lock; 

adl_refcnt_t refcnt; 
} // 24 bytes 

Figure 3: Internal structure of an adlist node. 



3.1.1 Mask bit 

The mask bit controls the visibility of a node in the adlist. 
If the mask bit is set the node is considered to be invis- 
ible. Iterators will skip over this node and not return it 
to the upper layers. Attempt to set the mask when it is 
already set returns an error to the caller. Hence the act 
of setting or unsetting the mask acts as a serialization 
point between an iteration and delete operation, between 
two delete operations, and between an iteration and insert 
operation. 

A node delete operation starts by first setting the mask 
bit. If two threads attempt to delete the same node, then 
only the thread that sets the mask will continue with the 
delete while the other will return failure to the caller. 
The moment when the mask is set is considered to be 
the moment that the node was removed from the list. 
Any iterators reaching the node after that will skip it 
as if it doesn't exist effectively serializing the operation 
as delete first, then iterate. If an iterator had already re- 
turned the node earlier, then the serialization is iterate 
first, then delete. Note that setting the mask only marks 
the point in time when the delete took place. The actual 
deletion of the node which involves updating the point- 
ers of the neighboring nodes still needs to be done and it 
has to co-ordinate with any other competing operations 
occurring in the same neighborhood. Also note that once 
a node has been masked for delete, the caller must even- 
tually go through with the delete. There is no option to 
abandon the intent to delete by clearing the mask bit. 
This restriction is mainly to keep things simple and we 
haven't found a compelling use case to support the aban- 
don option. 

When a node is inserted into the list, its mask bit is set 
until all the pointers are set up. Only then is the mask bit 
cleared. The moment of clearing the mask bit is consid- 
ered to be the moment that the insert actually occurred 
and is the serialization point between an insert and an 
iteration operation. 
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3.1.2 Pin count 

The value in the pincount field determines whether 
a node can be removed from the adlist or not. A non- 
zero value means that the node cannot be removed. 
The adl_node_pin API can be called to increment 
the pincount and adl_node_unpin can be called 
to decrement it. The unpin call will always succeed un- 
less the pincount is already zero (which will cause a 
crash). The adl_node_pin only succeeds if the mask 
bit is not set, otherwise it returns FALSE. Any caller to 
adl_node_pin must be able to handle the failure to pin. 
There is an internal function, adl_node_f orce_pin, 
that can increment the pincount even when the mask 
is already set if the existing value is already non-zero. 
This function is not exposed outside the adlist library and 
is used internally to give priority to iterators over a delete 
operation. 

Iterators pin the node they are on before returning it to 
the caller. The adl_iter_next API returns the "next" 
node based on the direction the iterator was set up with. 
There are two possible directions: (i) forward where the 
next field of a node is followed; and (ii) backward 
where the prev field is followed. Maintaining a pin on 
the returned node ensures that the iterators foothold (the 
node) will stay in the adlist until the iterator is invoked 
again to get the next node. Invoking adl_iter_next 
also implicitly releases the pin on the node returned ear- 
lier. 

3.1.3 Waitq 

The waitq is used to co-ordinate between a delete op- 
eration that is trying to remove the node from the list and 
any iterators that are moving backwards and need to pass 
over the node. 

Once a delete operation has set the mask bit, it cannot 
start on the process of actually removing the node unless 
the pincount on the node reaches zero. The deleting 
thread places its waiter's id in the waitq field when it is 
waiting for the pincount to reach zero. The last thread 
to call adl_node_unpin invokes the signal call of 
the delete thread's waiter. 

An iterator uses the waitq when it has to wait for a 
delete operation. Let's say an iterator that is going back- 
wards is on node A. Let's say the node before A is B and 
C is the one before B. Also, B is already masked while 
C is not masked. The iterator therefore has to skip over 
B and return C when adl_iter_next is called. It also 
needs to drop the pin it has on A and acquire one on C 
before returning it. To skip over B, the iterator has to first 



temporarily pin it. Only then can the iterator safely as- 
sume that B will not be removed while it tries to acquire 
the lock to read the B's prev pointer. Since B is al- 
ready masked, the iterator tries adl_node_f orce_pin 
call to increment the pincount. If the call succeeds, the 
iterator can drop the the lock and pin on A and restart the 
operation as if it was always on B and is going to C. If the 
call fails, then the delete operation had already been sig- 
naled to proceed and the iterator must wait for the delete 
to finish. It does so by adding its waiter to the waitq 
of B before dropping the lock on A. When the delete op- 
eration has finished, it will signal all the waiters on the 
waitq of B. At this point, B is no longer on the list and 
As prev points to C. The iterator still has its pin on A. 
Hence, it can basically restart the adl_iter_next op- 
eration. 

Note that the use of adl_node_f orce_pin call implic- 
itly gives priority to iterators over the delete operation. 
This behavior is not necessary but the rationale is that the 
forced pin is expected to be released relatively quickly 
as the iterator tries to skip over the masked node. It does 
expose the delete operations to the possibility of being 
starved but we have found it to be satisfactory in prac- 
tice. Systems where this would be unacceptable can sim- 
ply make the iterators wait if the node is already masked. 



4 Adlist APIs 

We have already seen hints of the APIs supported by 
adlists. We now present them systematically and prove 
that the APIs work reliably when there is contention. The 
API supported by adlist can be broadly divided into 3 
categories, (i) Operations that add nodes to the list, (ii) 
Operations that remove nodes from the list and (iii) op- 
erations that iterate over the list. Table[T]shows the avail- 
able APIs in each category. APIs that are really conve- 
nience wrappers around other calls are marked with an 
asterisk. Aside from these, there are also the aforemen- 
tioned adl_node_pin and adl_node_unpin calls to 
manage a node's removability in the adlist. 

Regardless of the operation type of an API call, they all 
follow some basic rules. All APIs that take a node as an 
input need a guarantee that the node is on the list. This 
is usually achieved by having the caller take a pin on the 
node, either directly or implicitly due to an earlier action, 
although having the node masked would also work. The 
only exception, naturally, is the node being added in in- 
sert operations which is initialized internally and added 
to the list. Functions that return a node as their result al- 
ways return the node in pinned state. Insert operations 
also leave the new node in pinned state. It is the callers 
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Category 


API 


Comment 


iterate 


adljiext 


- 


adLprev 


- 


adLfirst* 


- 


adUast* 


- 


adljterjnit* 


specify direction 


adl jter_next* 


- 


adljter.destroy* 


releases pin on 
current node, if any 


insert 


adLinsert .after 


- 


adlJnsert_before 


- 


adljnsert_at_front* 




adl_append_at_end* 


- 


delete 


adl_node_remove_start 


mask node 


adl_node_remove_waitonpincount 


wait to start delete 


adl_node_remove_do 


update pointers 


adl_node_delete* 


wrapper around 
above three 


adLpop* 


remove first node 


adLdequeue* 


remove last node 


adl jter.pop* 


remove current 
iter node 



Table 1: Supportable APIs of adlists. 



responsibility to drop the pin once it is no longer needed. 
Convenience wrappers like iterators drop the pin inter- 
nally on nodes that no longer need to be pinned. 

The agreed upon "canonical order" of locks in the list is 
along the next pointer of the nodes. Any lwlocks ac- 
quired in the canonical order can be acquired in blocking 
mode. Any locks acquired in the reverse order must be 
done using asynchronous -locking with proper handling 
of cases where the asynchronous -lock doesn't succeed 
right away. In other words, if a node B can be reached 
from node A by following one or more next pointers, 
then the lwlock of node B can be acquired in block- 
ing mode while holding the lwlock of A. To acquire the 
lwlock of A while holding the lwlock of B, we must 
use the async_lock call. The specific details of how 
to handle the case when async_lock returns FALSE 
varies slightly from API to API but the general essence 
is to drop the lwlock on B, then wait for the lock on A 
to be granted. Once acquired, the lock on B can be re- 
acquired in blocking mode. The natural questions to ask 
here are how do we ensure that (i) the node B remains 
valid when we are ready to re-acquire the lock and (ii) 
that the node B remains in a position after node A so that 
a blocking lock on node B won't violate the canonical or- 
der. We hope to convince the readers in the next portion 
that we do indeed maintain these properties. 

All the API calls are orchestrated carefully so that the 
above requirements are always honored regardless of 
what else happens. There are multiple means by which 
one can ensure that a node stays on the list: Having a 
pin on the node prevents it from being removed; Setting 
the mask bit ensure no other thread will try to remove 
the node; taking the lwlock, in either shared or exclu- 
sive mode, of a node or either of it's neighbors ensures 



that the node cannot be removed as the relevant pointers 
cannot be updated without acquiring these locks. Start- 
ing with the fact that an input node is either pinned or 
masked, each API call uses one or more of these guar- 
antees in a careful dance with all the other ongoing con- 
tending operations to ensure correctness when accessing 
any nodes required for the call. 

4.1 API Implementation 

Before we go into the contention cases to illustrate 
how correctness is maintained, let us look at the gen- 
eral structure of some selected core API operations. We 
will only look at operations that need to take locks in 
non-canonical order. Operations, such as adl_next and 
adl_insert_af ter, that only take locks in canonical 
order are relatively straight-forward. 

adLprev. The adl.prev call takes a node as an input 
and returns the first node before it which is not masked. 
The input node is guaranteed by the caller to remain on 
the list either because it is pinned by the caller or was 
masked for delete by the caller. When there is no con- 
tention, the operation will proceed as follows: (i) Ac- 
quire lock of input node in shared mode, (ii) Read the 
prev pointer, (iii) Call adl_node_pin to pin the pre- 
vious node, (iv) Release lock of input node, (v) Return 
previous node. 

adLinserLbefore. The adl_insert_bef ore call 
takes an existing node and a new node as input and in- 
serts the new node before the existing node. The existing 
node must be pinned or masked by the caller. In the no 
contention case, the operation proceeds as: (i) initialize 
the lock and refcnt of the new node. The refcnt is 
initialized as masked and pinned, (ii) Take exclusive lock 
on input node, (iii) Read prev pointer of input node, 
(iv) Use async_lock to take exclusive lock on previ- 
ous node, (v) Setup the needed pointers to add new node 
in between input and previous nodes, (vi) Unmask new 
node to make it visible. It is left pinned, (vii) Release 
locks on previous and input nodes and then return. 

adl_node .delete. The adl_node_delete operation 
deletes the input node from the list. Unlike the other 
APIs above, the input node must be pinned, not masked, 
by the caller beforehand. It is actually a wrapper around 
the 3 step node removal process. Step 1 is the call 
to adl_node_remove_start which tries to set the 
mask bit of the node. The function can return one of 
three possible outcomes: (i) The mask could not be set. 
Return FALSE to caller, (ii) mask is set and pincount 
was also decremented to 0. Go directly to step 3. (iii) 
mask is set but the pincount is not 0. The pin held 
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by the caller is not internally dropped in this case as was 
done in case (ii). The call needs to go through step 2. 

Step 2 is the adl_node_remove_waitonpincount 
function which drops the pin held by the caller and if 
the pincount is still non-zero, puts the caller's waiter 
in the waitq. This is done in a single CAS instruction. 
If the waiter was queued, it then calls the wait func- 
tion on its waiter. Upon waking up from the waiter, the 
pincount has reached 0, and the delete can proceed to 
step 3. 

Step 3 is the adl_node_remove_do function which 
does the work of updating the pointers to remove the 
node from the list. In the no contention case, the op- 
eration proceeds as follows: (i) Take exclusive lock on 
node, (ii) Read the prev pointer and take exclusive 
async_lock on the previous node, (iii) Read the next 
pointer and take exclusive lock on the next node, (iv) Up- 
date the relevant pointers and release the locks on the 
previous and next nodes. The node is now off the list 
and cannot be reached by anyone via the list, (v) "Clear" 
the deleted node which involves waking up any thread 
that was waiting either on the lock or the waitq of 
the node. The iterators waiting on the waitq are simply 
signaled. The threads waiting on the lwlock are expect- 
ing to have the lock transferred to them. To wake them 
up properly and wait for them to clear out, the function 
calls async_lock on the node's lwlock that it holds it- 
self already. This call will always result in the thread's 
waiter getting queued on the lwlock. The lwlock is then 
released which transfers it to the first waiting thread. As 
the waiting threads get the lock one by one, each realizes 
the node is no longer of interest and passes the lwlock 
to the next waiting thread. The lwlock eventually comes 
back to the thread doing the delete at which point it can 
be sure that there are no threads in any way trying to ac- 
cess the node. The delete operation is now complete and 
the node can be disposed off by the caller in any way it 
sees fit. 



4.2 Dealing with Contention 

We now look at how the various APIs handle contention. 
The simplest case is the iterator API which we discuss 
first. Looking at the adl_prev implementation, the only 
problem case contention it can run into is in step (iii) 
when the previous node is already masked. The outline 
of how we handle this was covered in section 13.1.31 on 
the waitq structure. Let us examine the correctness as- 
pect of the outline. Using the same example and nomen- 
clature: the iterator is on node A, B is masked and be- 
fore A, C is before B. Since the iterator has node A 



locked, we know that node B cannot move away. When 
the adl_node_f orce_pin call fails, the iterator has to 
add its waiter to the waitq of B. This must be done be- 
fore the delete operation has started clearing node B (step 
(v) of adl_node_remove_do). As the iterator has node 
A locked, we know the delete operation hasn't even got 
to the point of updating the pointers to remove B from the 
list. Hence the step of adding the waiter to the waitq is 
safe. The iterator will then drop the shared lock on A, 
thereby stepping out of the way of the delete operation, 
before calling wait. The iterator still has a pin on node 
A so that the node will stick around on the list while the 
iterator waits for the delete of B to finish. 

Next, consider the case of insert and delete operations. 
The problem case of contention in both comes in the 
async_lock step where the function tries to take ex- 
clusive locks in non-canonical order. The basic operation 
of taking locks in reverse order under contention was ex- 



plained in section 2.2 Using the nomenclature of that 
section, let's call the input node as B and the one before 
it as A. To recap, when the async_lock call returns 
FALSE due to contention, the function drops the exclu- 
sive lock on B. It then waits for the lock on node A to 
be granted. Upon receiving the lock on A, the lock on 
node B is reacquired. This is done in a regular blocking 
(or synchronous) lock call. 

For correctness, we need to show that the input node B 
is still valid and that the lock order constraint is not vio- 
lated. We know that B is still valid as it was pinned upon 
entering the function and remains pinned (or masked) 
throughout. Hence, the lock call will not be making an 
illegal memory access. Initially, node A was before node 
B and it is now (a) either off the list if the contention was 
with a delete operation; or (b) one or more nodes have 
been added between A and B. In case (a), the contending 
delete operation was stuck as long as the lock on B was 
held. The async_lock call put the thread's waiter in 
line to get the lock on A before stepping out of the delete 
operations way. Therefore, when the function finally gets 
the lock on node A, the node is off the list and the con- 
tending delete operation is in clear node stage (step (v)) 
of adl_node_remove_do function. Since A is now off 
the list, there is no lock order restriction on it with re- 
spect to node B (or any other node on the list). This is 
because no other thread can get to node A anymore. In 
case (b), new nodes have been added in between but the 
nodes are still in the canonical order. The input node B 
is still ahead of node A and reachable by following the 
chain of next pointers from A. Node A could not have 
moved ahead of B as that would require deleting node 
A from the list first. But as we saw in case (a), the delete 
operation on A would have waited in the clear node stage 
for this operation and there would be no lock order vio- 
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lation. 

From the preceding paragraph, we can see that the 
async_lock call will not run into any illegal memory 
access or deadlock situation. Since the list could have 
changed, once the lock on node B is reacquired, we do 
recheck the prev pointer of B to see if it still points to 
A (which is also locked by the function at this point). If 
the pointer has changed due to a delete or an insert, then 
the lock on A is released and the operation is back in its 
earlier state with only the input node pinned/masked and 
locked. The operation resumes from that state. 



5 Observations and Extensions 

We now present some observations on the expected be- 
havior of adlists some of which has also been corrobo- 
rated by empirical experience of using them. The adlist 
construct was developed to address the scalability issues 
of a regular doubly-linked list being seen in the Data Do- 
main File System. It is part of the next major release of 
the Data Domain OS and has been in active use inter- 
nally for over a year. It is used in, among other things, 
simple list based LRU caches, maintaining lists of open 
connections, open streams and lists of pending IOs. 

(i) Since the node delete operation has to wait for the 
pincount of the node to drop to 0, the thread calling 
the delete on a node should not hold a pin on any other 
node. Otherwise, it can get into a circular dependency 
with another thread and result in a deadlock. The moti- 
vation behind splitting the delete node API into 3 steps 
was to allow a thread to take any action necessary before 
it blocks to wait for the pincount to reach 0. 

(ii) Since any operation on an adlist involves more steps 
and has to involve more locks than a traditional single 
mutex approach, replacing a doubly-liked list with an 
adlist only makes sense for lists seeing a fair amount of 
contention. 

(iii) Since each operation only acquires local locks, each 
operation is only guaranteed local consistency. For ex- 
ample, an iterator could see the same node twice if an- 
other thread removes it from the list after the iterator 
has crossed it and inserts it back in front of the iterator. 
Similarly, an iterator can fail to see a node if it moves 
backward in the list. We have found this to be accept- 
able in most use cases. For lists that need globally con- 
sistent view for certain operations, an adlist needs to be 
extended with another reader- writer lock. All operations 
that do not care for a globally consistent view but can 
modify the list must acquire the reader lock. An opera- 
tion that requires a globally consistent view must take the 



writer lock. An iterator that doesn't care about globally 
consistent view need not worry about the extra lock at 
all. 

(iv) An adlist shows the biggest improvement when the 
access pattern over the list is scattered or if the list is 
subject to frequent or long running iterations. If the bulk 
of the operations are concentrated at any particular lo- 
cation (head or tail for example), then the contention on 
the lock will reduce the performance to be not much bet- 
ter than a single mutex protected list. There are some 
remedies in special situations. For example, if a list is be- 
ing used purely as a queue or a stack, then the lock-free 
versions of the data structures is a much better choice. 
For use cases such as an LRU cache, the reclaim opera- 
tions tend to be limited in parallelism (sometimes only 
1 thread can perform it) while re-prioritize operations 
create pressure on the end of the list (say head) which 
has "recent" nodes. However, since the insert at head for 
recent nodes does not need to be strictly at the head it- 
self, we can use the following trick: The list is initialized 
with a certain number of special pre-populated dummy 
nodes in the head area. An insert at head picks one of the 
dummy nodes randomly and inserts the node after it. The 
number of dummy nodes used determines their impact. 
The count should be high enough to lower contention but 
not so high that skipping them during iterations becomes 
an issue. The dummy nodes are also periodically moved 
back as reclaims would otherwise slowly move them to- 
wards the tail. 

(v) As is evident from section |4~2] the contention resolu- 
tion process of adlist requires the "yielding" operation to 
effectively restart. This means that the adlist operations 
are, in theory, prone to starvation. We have found this 
to be a non-issue in practice. Although we have some 
ideas on how to reduce or eliminate this possibility, we 
haven't pursued them owing to the added complexity and 
we leave out the details due to space constraints. 



6 Performance 



We wrote two benchmarks to evaluate the performance 
of an adlist and its extension that makes use of dummy 
nodes to lower contention in hot spots of the list. We 
compare adlists with single mutex protected doubly- 
linked lists, or "dlists" for short from now on. The first 
benchmark simulates a scattered workload assuming uni- 
form access to the lists. The second benchmark simulates 
the workload typically seen in LRU caches. We present 
the experimental evaluation for the two benchmarks in 



sections 6.1 and 6.2 respectively. 
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Each experiment in the benchmarks was run 20 times 
which was enough to get a confidence level of 99% on 
the presented average values. All the experiments were 
carried out on a Intel(R) Xeon(R) Processor E5504. It 
has 8 cores, each one operating at 2.0GHz and with 4MB 
of cache. The machine has 12GB of DDR2 memory op- 
erating at 1333MHz. 

6.1 Uniform access to the lists 

In this benchmark each thread is concurrently inserting 
and removing local elements from random points of the 
list. A thread will only insert and remove its own ele- 
ments but it has to iterate over interleaving elements in- 
serted by other threads. Each thread inserts and removes 
batches of 128 elements and a thousand batches are given 
to each thread. Therefore, each thread inserts and re- 
moves 128, 000 elements. This simulates the behavior of 
a work-queue where individual requests come and go as 
they are received and processed. 

Figure [4] shows how the total time to run the bench- 
mark varies with the number of threads. As we throw in 
more threads the contention increases linearly for dlists 
whereas for adlists it stays flat due to the finer-grain lock- 
ing provided by adlists. For this case the use of dummy 
nodes is not needed because the workload does not create 
any hot spots in the lists. 



Operations uniformly distributed 




123456789 10 
Number of threads 



Figure 4: Operations uniformly distributed. 



6.2 LRU workload 

In this benchmark we mimic workloads seen by LRU 
caches within the Data Domain File System. There is an 
array of 77 elements to be inserted so that it is straightfor- 
ward to have direct access to the elements. Each element 



is identified by a serial number and has a boolean flag 
to indicate whether it is present in the LRU cache. All 
threads carry out the same set of operations and only one 
of them can evict elements out of the LRU cache at a 
time. We call the evicting thread "reclaimer". To ensure 
we only measure the contention on the LRU, each thread 
uses a private lock-free stack to hold elements available 
to it. This eliminates contention that would otherwise 
show up from the shared pool of available elements. A 
thread may execute the following operations: 

(i) re-prioritize operation: an element is removed from 
its current position in the LRU list and moved to the 
head of the LRU list (highest priority point). 

(ii) insert operation: pop an element from the per- 
thread lock-free stack of available elements and in- 
sert it in at the head of the LRU list. 

(iii) evict operation: k elements are removed from the 
tail of the LRU list and pushed back into the per- 
thread lock-free stacks in a round-robin fashion. 
Each element has an eviction cost associated to it 
which can be set when running the test. 

We have looked into three scenarios. The first one is 
when the cache is being warmed up and therefore each 
thread tries to find an element to re-prioritize and also 
inserts a new element. We call this workload "50% re- 
prioritize and 50% inserts". In this workload each thread 
randomly picks 100, 000 elements, re-prioritizes them if 
they are found in the LRU cache and also inserts other 
100, 000 elements. To have no contention due to evic- 
tion, we make 100, 000 elements available per thread. 
This workload will lead to heavy contention on the head 
of the LRU list. Figure [5] shows how the total time to 
run the benchmark varies with the number of threads. 
For this type of workload dlists and adlists behave sim- 
ilarly. However, the extended adlists with dummy nodes 
leads to great scalability since the total time stays ap- 
proximately flat as we increase the number of threads 
and keep the number of operations per thread constant. 
We fixed the number of dummy nodes at 64. Later in 
this section we show how the number of dummy nodes 
affects the scalability. The main advantage of this exten- 
sion is that it keeps the simplicity of the code since a 
single LRU list has to be maintained instead of multi- 
ple LRU lists that would be the default extension using 
dlists. 

The second scenario is when eviction happens quite often 
and its cost dominates the cost of re-prioritize operations. 
In this workload each thread randomly picks 200, 000 el- 
ements, re-prioritizes them if they are found in the LRU 
cache and also inserts other 20, 000 elements. We call 
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50 % re-prioritize and 50 % inserts 90 % re-prioritize and 10 % inserts 




123456789 10 123456789 10 

Number of threads Number of threads 



Figure 5: Warming up the LRU cache. The extended 
adlist uses 64 dummy nodes. 

this workload "90% re-prioritize and 10% inserts". To 
create contention due to eviction we make 10, 000 ele- 
ments available per thread (50% of the amount it needs 
to insert). We also set the eviction cost at 50 microsec- 
onds per element to simulate cases where reclaimation 
is expensive. The larger this cost, the larger is the con- 
tention. When eviction is triggered, the reclaimer thread 
evicts k = 100 elements at a time. This workload leads 
to contention due to a reclaimer thread iterating from the 
tail towards the head for a considerable amount of time. 
Also, most of the operations over the list are happening 
on the head which creates a hot spot. 

Figure [6] shows how the total time to run the bench- 
mark varies with the number of threads. As the num- 
ber of threads increases we expect the total running time 
to increase linearly for the three implementations (dlists, 
adlists and extended adlists). That is because there can 
be a single reclaimer thread at any point in time. A dlist 
presents the worse scalability because it can face con- 
tention due to all three type of operations: re-prioritize, 
insert and iteration due to eviction; whereas the adlists 
do not face contention due to latter. The best scalabil- 
ity comes from an extended adlist with 64 dummy nodes 
because it only faces contention due to a single reclaimer 
thread that makes the other threads wait on the availabil- 
ity of elements to insert back. 

The third scenario is when the cost of re-prioritize oper- 
ations dominates the cost of evict operations. To achieve 
that we increase the frequency of re-prioritize opera- 
tions compared to evict operations. In this workload 
each thread randomly picks 2,000,000 elements, re- 
prioritizes them if they are found in the LRU cache and 
also inserts other 20, 000 elements. We call this workload 
"99% re-prioritize and 1% inserts". We keep the cost for 
evict operations the same compared to the previous sce- 



Figure 6: Contention due to the cost of evict operations 
dominating the cost of re-prioritize operations. The ex- 
tended adlist uses 64 dummy nodes. 

nario by using the same afore-described configuration. 
As in the first scenario, this workload leads to heavy con- 
tention on the head of the LRU list. Therefore dlists and 
adlists behave similarly whereas extended adlists with 64 
dummy nodes scales better. Figure [7] shows how the to- 
tal time to run the benchmark varies with the number of 
threads for each of the three implementations. 



99 % re-prioritize and 1 % inserts 




Figure 7: Contention due to the cost of re-prioritize op- 
erations dominating the cost of evict operations. The ex- 
tended adlist uses 64 dummy nodes. 

We now evaluate how the dummy nodes affect the run- 
time for a fixed number of threads. We use the same con- 
figuration described above for the scenario three and fix 
the number of threads in 10. A single dummy node in an 
extended adlist makes it equivalent to a normal adlist and 
all the threads are going to contend on the list's head. To 
understand the impact of the number of dummy nodes in 
the total runtime we can see this as the standard "balls 
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into bins problem" where to = 10 threads are thrown 
into n bins (dummy nodes). The total runtime is inversely 
proportional to the size of the bin with maximum load 
(i.e., maximum number of contending threads). A tight 
analysis for this problem is shown in [9| and our exper- 
imental data matches their analysis. That is, the runtime 
is 0(t/log(n)) where t is the runtime for n — 1. Figure 
[HJshows both the experimental and theoretical curves. 

Impact of dummy nodes for m = 10 threads 
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Figure 8: Impact of dummy nodes in the runtime with 
to = 10 threads. 



7 Conclusions 

In this paper, we have demonstrated a way of building 
doubly-linked lists that support greater concurrency than 
traditional lists. Dubbed adlists, our lists do not place any 
restriction on the number or memory source of the mem- 
ber nodes. The overhead per node is only 8 bytes which 
we think should be acceptable in almost all scenarios. 
The low overhead and high concurrency is made possi- 
ble due to the use of light-weight synchronization primi- 
tives that provide a novel async_lock mechanism. Us- 
ing async_lock allows us to define mechanisms to ac- 
quire locks in non-canonical order for a data structure. 
Being able to do so is crucial to increasing concurrency. 
Although we have focused exclusively on the list data 
structures, our approach is general in nature and should 
be extensible to most other data structures. 
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